Analyzing the 2020 US Democratic Presidential Primary on Reddit

by Steven Espinoza

In the weeks leading up to the Iowa Caucuses on February 3, 2020, candidates for the Democratic nomination to the US Presidency seemed to have finally stabilized their positions at the polls. As of mid-January, it seems that the top four contenders for the race will be Bernie Sanders (who currently leads in both IA and NH), Joe Biden, Elizabeth Warren, and Pete Buttigieg.

This Jupyter notebook serves as a way to 1) analyze how the conversation around the Democratic presidential primary has played on Reddit, 2) share with others how to make good use of the PRAW package and the Pushshift API in order to mine valuable data, and 3) how to make interactive visualizations with this data using the most recent version of Plotly (4.4.1). PRAW and Pushshift were created specifically for Reddit; their documentation can (respectively) be found here and here. Plotly is an open-source software popular in both Python and R, and a good tutorial on how to integrate it within a Jupyter Notebook can be found here.

For the purposes of this notebook, we will only focus on the four candidates mentioned above. Viewers of this notebook should feel free to download the notebook and use the functions I've created to play with other candidates as well.

Import Statements and Setup

Before moving forward, we will first install some packages for basic data extraction, manipulation, and visualization.

In [1]:
### for data manipulation, data structure, cleaning, etc.
import pandas as pd 
import numpy as np
from datetime import datetime
import random
import re

### for API requests and reddit data
import requests
import praw

### for data visualization (with plotly 4.4.1)
import chart_studio.plotly as py
import plotly.graph_objects as go
import matplotlib.pyplot as plt
from wordcloud import WordCloud

Moreover, before moving forward we need to set up an instance of the Reddit class. In order to responsibly make use of Reddit's API, we must first log into our Reddit accounts, go to the Reddit apps page, and create an application by clicking on the button on the bottom-left of the page.

At this point, you will give your application a name and a description. For instance, you can name it something like "democrats2020". Also, it doesn't really matter which kind of app you create, though the "personal script" option seems like the best one. Moreover, the PRAW documentation suggests writing http://localhost:8080 in the redirect uri box.

Upon creating the application, we will get our client_id, our client_secret, and our user_agent:

  • The client_id is shown a couple of lines below the name for your application.
  • The client_secret is the 27-character key next to "secret."
  • Finally, the user_agent is the name of your Redditor username.

Make sure to fill in the cell below with your respective keys.

In [2]:
### setting up reddit client with client_id, client_secret, and user_agent provided on reddit.com/dev/apps
reddit = praw.Reddit(client_id='YOUR CLIENT ID',
                     client_secret='YOUR CLIENT SECRET',
                     user_agent='YOUR REDDIT USERNAME')

Finally, to make life much easier later on, I created some initial functions to easily convert Unix time to datetime, and datetime back to Unix time. The reasons for this will be come apparent once we start learning how to work with the Pushshift API.

In [3]:
def unix_to_dt(unix):
    """
    params: 
    unix: an INT representing a unix timestamp
    
    returns:
    datetime: a STRING object in year-month-day hour:minute:second format
    """
    return datetime.utcfromtimestamp(unix).strftime('%Y-%m-%d %H:%M:%S')
In [4]:
def dt_to_unix(year, month, day, hour = 0, minute = 0, second = 0):
    """
    params: year, month, day, hour, minute, second (all INT values)
    
    returns: the unix timestamp (INT) in eastern standard time (EST)
    """
    return datetime(year, month, day, hour, minute, second).timestamp()

Part 1: Analyzing Volume by Candidate

As a first question, we might want to simply asses the volume of comments on Reddit that mention a specific candidate. The function I created below, get_volume, uses the Pushshift API to aggregate the number of comments used for a given query on a day level basis.

For simplicity, we would like to restrict the searches to all of the comments made with a query only after January 1, 2019. We can filter for this by using the "after" filter as described in the Pushshift API documentation, which only reads times in UTC format. Therefore, first we find out what the UTC is for January 1, 2019:

In [5]:
### getting UTC for January 1, 2019
dt_to_unix(2019, 1, 1)
Out[5]:
1546318800.0

I then plug this integet value in the "after" parameter included within the link object used in the get_volume function below. Note that this will stay the same for all of the candidates included.

In [6]:
def get_volume(query):
    """
    params: 
    query: a STRING value representing a query
    
    returns:
    df: a PANDAS DATAFRAME showing the number of Reddit comments since 1/1/2019 that include query
    """
    ### first, getting queries:
    # after: 1/1/2019 (UTC = 1546318800)
    # aggs: aggregating by day
    link = """https://api.pushshift.io/reddit/search/
    comment/
    ?q={}
    &after=1546318800
    &aggs=created_utc
    &frequency=day
    """.replace('\n', '').replace(' ', '').format(query)
    
    ### making an HTTP request to the link
    r = requests.get(link)
    
    ### convering the string into a Python-readable JSON format
    json = r.json()
    
    ### combining all of the information into a pandas DataFrame
    df = pd.DataFrame({'date': [unix_to_dt(i['key']) for i in json['aggs']['created_utc']],
                       'candidate': [query] * len(json['aggs']['created_utc']),
                       'count_comments': [i['doc_count'] for i in json['aggs']['created_utc']],
                      }) 
    
    ### changing the type of "date"
    df.date = df.date.astype('datetime64[ns]')
    
    ### finally, returning the dataframe
    return df
In [7]:
### example of get_volume(query):
get_volume('Pete Buttigieg').sample(5)
Out[7]:
date candidate count_comments
170 2019-06-21 Pete Buttigieg 52
137 2019-05-19 Pete Buttigieg 50
241 2019-08-31 Pete Buttigieg 18
328 2019-11-26 Pete Buttigieg 334
116 2019-04-28 Pete Buttigieg 86

To combine all of the data of our four candidates into a single dataframe, we will use Pandas' DataFrame.append functionality:

In [8]:
### create a dataframe with aggregated numbers of comments containing candidate names
comment_aggs = get_volume('Joe Biden')\
.append(get_volume('Elizabeth Warren'))\
.append(get_volume('Bernie Sanders'))\
.append(get_volume('Pete Buttigieg'))

Finally, we will use Plotly to visualize the trends over time in a modern, interactive fashion.

In [9]:
### first, setting the layout of the new chart
layout = go.Layout(
    title={"text": "Analyzing Volume over Time (Reddit Comments)",  ### setting title
           'x': 0.5, ### title x-position
           'y': 0.9, ### title y-position
           'xanchor': 'center'}, ### anchored center
    xaxis=dict(title='Date'), ### setting xlabel
    yaxis=dict(title='Number of Reddit Comments'), ### setting ylabel
    hovermode='x') ### allows hovering over multiple lines

### then, initializing the figure using the layout outlined above
fig = go.Figure(layout=layout)

### then, adding the traces to the figure for each candidate
for i in ['Pete Buttigieg', 'Elizabeth Warren', 'Bernie Sanders', 'Joe Biden']:
    fig.add_trace(go.Scatter(
                x=comment_aggs.date, ### x-values: date
                y=comment_aggs[comment_aggs.candidate == '%s' %(i)].count_comments,  ### y-values: sum of comments
                name="%s" %(i))) ### labels: candidate names

### finally, showing the plot
fig.show()

It seems that by far, the most popular candidate discussed on Reddit is Bernie Sanders. Joe Biden seems to follow closely behind (with some rare moments where he dominates the conversation), while Warren and Buttigieg follow behind Biden.

To understand how these candidates are being talked about on Reddit, it might be helpful to analyze which Subreddits often mention these candidates in the comments' section.

Analyzing Discussion by Subreddit

In [10]:
def get_top_subreddits(query, top = 10):
    """
    params: 
    query: a STRING value representing a query
    top: an INT value representing the number of top subreddits to spit out (default=10)
    
    returns:
    df: a PANDAS DATAFRAME showing the number of Reddit comments since 1/1/2019 with a specific subreddit
    """
    ### first, getting queries:
    # after: 1/1/2019 (UTC = 1546318800)
    # aggs: aggregating by day
    link = """https://api.pushshift.io/reddit/search/
    comment/
    ?q={}
    &after=1546318800
    &aggs=subreddit
    &frequency=day
    """.replace('\n', '').replace(' ', '').format(query)
    
    ### making an HTTP request to the link
    r = requests.get(link)
    
    ### convering the string into a Python-readable JSON format
    json = r.json()
    
    ### combining all of the information into a pandas DataFrame
    df = pd.DataFrame({'candidate': [query] * len(json['aggs']['subreddit'][:top]),
                       'subreddit': [i['key'] for i in json['aggs']['subreddit'][:top]],
                       'count_comments': [i['doc_count'] for i in json['aggs']['subreddit'][:top]],
                      }) 
    
    ### finally, returning the dataframe
    return df
In [11]:
### example: getting the top 7 subreddits that mention Joe Biden
get_top_subreddits('Joe Biden', 7)
Out[11]:
candidate subreddit count_comments
0 Joe Biden politics 58563
1 Joe Biden The_Donald 10456
2 Joe Biden worldnews 6345
3 Joe Biden ChapoTrapHouse 4832
4 Joe Biden neoliberal 4211
5 Joe Biden SandersForPresident 3883
6 Joe Biden AskReddit 3344
In [12]:
### first, initializing the figure
fig = go.Figure()

### then, adding the four traces
for i in ['Joe Biden', 'Pete Buttigieg', 'Bernie Sanders', 'Elizabeth Warren']:
    fig.add_trace(go.Bar(x=get_top_subreddits('%s' %(i)).subreddit,
                         y=get_top_subreddits('%s' %(i)).count_comments,
                         name='%s' %(i),
                         visible=False ### important to make sure that the first plot isn't an aggregated sum
                        ))

### then, adding the button functionality
fig.update_layout(
    updatemenus=[go.layout.Updatemenu(buttons=list([
        dict(args=["visible", [True, False, False, False]],  ### this is important: it shows which of the 
                                                             ### four traces to show when the button is clicked
             label="Joe Biden",
             method="restyle"),
        dict(args=["visible", [False, True, False, False]],
             label="Pete Buttigieg",
             method="restyle"),
        dict(args=["visible", [False, False, True, False]],
             label="Bernie Sanders",
             method="restyle"),
        dict(args=["visible", [False, False, False, True]],
             label="Elizabeth Warren",
             method="restyle")]),
      direction="down",
      showactive=False,
      xanchor="left",
      yanchor="top",
      x = 0.25, y=1.2)])

### adding an annotation
fig.update_layout(
    annotations=[
        go.layout.Annotation(text="<b>Select a Democratic Candidate</b>", x=-0.01, 
                             xref="paper", y=1.18, yref="paper",
                             align="left", showarrow=False)
    ])

### finally, showing the plot
fig.show()

From the above, it seems (unsurprisingly) that r/politics is a favorite forum for discussion of all the candidates. A better graph might show how much the candidates are discussed beyond r/politics.

In [13]:
### first, initializing the figure
fig = go.Figure()

### then, adding the four traces
for i in ['Joe Biden', 'Pete Buttigieg', 'Bernie Sanders', 'Elizabeth Warren']:
    fig.add_trace(go.Bar(x=get_top_subreddits('%s' %(i))[get_top_subreddits('%s' %(i)).subreddit 
                                                         != 'politics'].subreddit,
                         y=get_top_subreddits('%s' %(i))[get_top_subreddits('%s' %(i)).subreddit 
                                                         != 'politics'].count_comments,
                         name='%s' %(i),
                         visible=False ### important to make sure that the first plot isn't an aggregated sum
                        ))

### then, adding the button functionalities
fig.update_layout(
    updatemenus=[go.layout.Updatemenu(buttons=list([
        dict(args=["visible", [True, False, False, False]],
             label="Joe Biden",
             method="restyle"),
        dict(args=["visible", [False, True, False, False]],
             label="Pete Buttigieg",
             method="restyle"),
        dict(args=["visible", [False, False, True, False]],
             label="Bernie Sanders",
             method="restyle"),
        dict(args=["visible", [False, False, False, True]],
             label="Elizabeth Warren",
             method="restyle")]),
      direction="down",
      showactive=False,
      xanchor="left",
      yanchor="top",
      x = 0.25, y=1.2)])

### adding an annotation
fig.update_layout(
    annotations=[
        go.layout.Annotation(text="<b>Select a Democratic Candidate</b>", x=-0.01, 
                             xref="paper", y=1.18, yref="paper",
                             align="left", showarrow=False)
    ])

### finally, showing the plot
fig.show()

Beyond r/politics, it seems that much of the discussion around Bernie Sanders is driven by some of the pro-Sanders subreddits, such as r/SandersForPresident, r/WayOfTheBern, and r/ChapoTrapHouse. Interestingly, both Joe Biden and Elizabeth Warren seem to have been popular on the r/The_Donald subreddit, a pro-Trump subreddit that was "quarantined" in the summer of 2019 for threats of violence. Pete Buttigieg, unsurprisingly, draws a large share of discussion about him on the r/Pete_Buttigieg subreddit beyond r/politics.

Analyzing Topics with Wordclouds

Moving forward, it would be interesting to see what kinds of topics come up with regard to each candidate, and also to analyze how the topics surrounding each candidate may differ given a specific subreddit. With word clouds, we might get a good idea.

It's important to note here that the Pushshift API only returns 25 posts by default. We can increase the limit by adding a size paramater to the link we use; however, even then, the largest integer we can use for this method is 500.

I discuss more about how to get around possibly get around this in the "Future Work" section below, though this raises questions about how to make use of the Pushshift API in a responsible fashion. However, here is where we could make better use of the PRAW package to make better sense of the way submissions and comments are related to one another, and to make better sense of how the candidates are actually being discussed in the Reddit comments.

The code below shows how we make use of the search() method; it returns all of the submissions that include a phrase, which in our example below will be Elizabeth Warren.

In [14]:
### example: showing how the search method works (with limit = 250)
print(list(reddit.subreddit('all').search('Elizabeth Warren', limit=250)))
[Submission(id='cq9j51'), Submission(id='eop4w9'), Submission(id='bgfpzu'), Submission(id='eokxr3'), Submission(id='ck7rdw'), Submission(id='ab7htt'), Submission(id='eousl5'), Submission(id='cwppzq'), Submission(id='epstir'), Submission(id='eo8ctx'), Submission(id='eoq0hz'), Submission(id='eo1t6z'), Submission(id='eqhut0'), Submission(id='ekvy4p'), Submission(id='enhscr'), Submission(id='eq0bkd'), Submission(id='ep3ybx'), Submission(id='eejgt2'), Submission(id='eoq84x'), Submission(id='ep4sjm'), Submission(id='eodj0y'), Submission(id='eq3mo3'), Submission(id='dv9nlr'), Submission(id='ecwncz'), Submission(id='eo9yak'), Submission(id='eod9mr'), Submission(id='ep37by'), Submission(id='ebe6gf'), Submission(id='ed9m6n'), Submission(id='ep3hgo'), Submission(id='e5t5xj'), Submission(id='ep4sdd'), Submission(id='elvblt'), Submission(id='emfdxc'), Submission(id='e3h86e'), Submission(id='ellozq'), Submission(id='ebvy16'), Submission(id='epc1eg'), Submission(id='ep54lb'), Submission(id='ep3e7n'), Submission(id='ekxvp9'), Submission(id='ep9t95'), Submission(id='ej74np'), Submission(id='em52mj'), Submission(id='emsc02'), Submission(id='ealj7z'), Submission(id='dkjho2'), Submission(id='ep8m4z'), Submission(id='eo7t7v'), Submission(id='elems8'), Submission(id='eqe4cq'), Submission(id='dvc3lf'), Submission(id='er0u7k'), Submission(id='ecd69r'), Submission(id='eorirm'), Submission(id='eldgfj'), Submission(id='emzoce'), Submission(id='eodknb'), Submission(id='eqk1co'), Submission(id='egd9c2'), Submission(id='ep34vj'), Submission(id='d7is6i'), Submission(id='eo7vrf'), Submission(id='ep3fjs'), Submission(id='ep4ifr'), Submission(id='epuokj'), Submission(id='dq3qpv'), Submission(id='eob1uz'), Submission(id='epd8y5'), Submission(id='eoy2ap'), Submission(id='eoarjv'), Submission(id='dq6kvs'), Submission(id='dvaejl'), Submission(id='dcudqe'), Submission(id='eo7rl0'), Submission(id='eqxgg9'), Submission(id='e318u7'), Submission(id='een2wn'), Submission(id='e2fa9e'), Submission(id='eeqt19'), Submission(id='eorvec'), Submission(id='epbztd'), Submission(id='ejejp2'), Submission(id='epvepj'), Submission(id='ep1hh6'), Submission(id='eqai9u'), Submission(id='ducnw7'), Submission(id='e5lier'), Submission(id='eij7m4'), Submission(id='dy65h4'), Submission(id='eol3iy'), Submission(id='epd85z'), Submission(id='eoyjm5'), Submission(id='diry0d'), Submission(id='eqef27'), Submission(id='eoxx7m'), Submission(id='ea6yh1'), Submission(id='epja4p'), Submission(id='eq2mma'), Submission(id='elw43g'), Submission(id='emrefm'), Submission(id='ei1ywi'), Submission(id='ep6xxi'), Submission(id='e6zbck'), Submission(id='e1f9ea'), Submission(id='eom1lv'), Submission(id='enw689'), Submission(id='elfwsj'), Submission(id='enwdzw'), Submission(id='eh04gw'), Submission(id='eelyh9'), Submission(id='eonbiu'), Submission(id='dv0ap9'), Submission(id='epj9cm'), Submission(id='ej2jhm'), Submission(id='ejft10'), Submission(id='dhci5l'), Submission(id='eojtm4'), Submission(id='ep663u'), Submission(id='dtwp0k'), Submission(id='enwomq'), Submission(id='elamxp'), Submission(id='el68vv'), Submission(id='eneima'), Submission(id='edcngw'), Submission(id='eon8vs'), Submission(id='eodn7m'), Submission(id='eld1qv'), Submission(id='ep4lta'), Submission(id='dtjbvj'), Submission(id='eb8682'), Submission(id='efj19f'), Submission(id='eoelqh'), Submission(id='eq19dk'), Submission(id='eqlz7t'), Submission(id='eob4yi'), Submission(id='ep5kol'), Submission(id='epcr87'), Submission(id='ejgf31'), Submission(id='eok046'), Submission(id='eo81bg'), Submission(id='ent5yo'), Submission(id='ee0ure'), Submission(id='d4yl2n'), Submission(id='ebg7r0'), Submission(id='d4jhvf'), Submission(id='eea8wo'), Submission(id='epzx52'), Submission(id='e6fzp6'), Submission(id='elbuz6'), Submission(id='d53xzz'), Submission(id='epqdp8'), Submission(id='eg006u'), Submission(id='e7udjn'), Submission(id='edatpy'), Submission(id='eoz2ok'), Submission(id='eejpxy'), Submission(id='eqfyms'), Submission(id='dmlg06'), Submission(id='dp6uew'), Submission(id='ep3qhg'), Submission(id='eqtduh'), Submission(id='e6x62a'), Submission(id='emi6ax'), Submission(id='duwepx'), Submission(id='ej07n9'), Submission(id='elwhvx'), Submission(id='dfarax'), Submission(id='egdzdh'), Submission(id='cszqv9'), Submission(id='eoslbh'), Submission(id='egxild'), Submission(id='egdtot'), Submission(id='eqh1ul'), Submission(id='eo7q8e'), Submission(id='duuneb'), Submission(id='dj8538'), Submission(id='egda4f'), Submission(id='dvb90l'), Submission(id='edktae'), Submission(id='elscwc'), Submission(id='ei7hgt'), Submission(id='eo9dih'), Submission(id='efa8y3'), Submission(id='dgosr9'), Submission(id='eqlc6q'), Submission(id='ephqus'), Submission(id='dzfu5h'), Submission(id='epp0wq'), Submission(id='dibkcr'), Submission(id='dina5s'), Submission(id='eplzpy'), Submission(id='enyc45'), Submission(id='ei2svq'), Submission(id='dl1v9j'), Submission(id='eebndu'), Submission(id='emfxt9'), Submission(id='emwgf0'), Submission(id='egum9w'), Submission(id='dudxkp'), Submission(id='dq5fwi'), Submission(id='epocjt'), Submission(id='emdn9q'), Submission(id='eoo6gl'), Submission(id='eotz0p'), Submission(id='eo9tbw'), Submission(id='eokkbt'), Submission(id='eog0of'), Submission(id='e93kb6'), Submission(id='e60p5z'), Submission(id='eoqveo'), Submission(id='e7kjw3'), Submission(id='een8qx')]

Each of the objects returned are of the Submission class, and all of its attributes can be found here. For instance, if we wanted to find the link of the first Submission in the list above, we would add .permalink to the first object in the list:

In [15]:
### getting the link to the reddit: reddit.com/.....
list(reddit.subreddit('all').search('Elizabeth Warren'))[0].permalink
Out[15]:
'/r/pics/comments/cq9j51/bernie_sanders_and_elizabeth_warren_flying_coach/'

What's particularly useful about searching for queries in this fashion is that we can use PRAW above to get the submissions that mention, e.g., Elizabeth Warren in the submission field, and then use each of the submission IDs to get all of the comments for each submission. I demonstrate this below for the first submission from above:

In [16]:
### getting all comments from the first submission
list(list(reddit.subreddit('all').search('Elizabeth Warren'))[0].comments)
Out[16]:
[Comment(id='ewupqqs'),
 Comment(id='ewurryw'),
 Comment(id='ewvcy40'),
 Comment(id='ewutgjk'),
 Comment(id='ewuua9c'),
 Comment(id='ewvutde'),
 Comment(id='ewv9e1x'),
 Comment(id='ewv0cbi'),
 Comment(id='ewv40ry'),
 Comment(id='ewv4zw6'),
 Comment(id='ewuuh2j'),
 Comment(id='ewv7ief'),
 Comment(id='ewux5k9'),
 Comment(id='ewvdwvh'),
 Comment(id='ewvpe55'),
 Comment(id='ewvb34p'),
 Comment(id='ewuybmw'),
 Comment(id='ewvcaf2'),
 Comment(id='ewvhnyj'),
 Comment(id='ewwo5xe'),
 Comment(id='ewv04d0'),
 Comment(id='ewvaoji'),
 Comment(id='ewvb0lp'),
 Comment(id='ewuuz5j'),
 Comment(id='ewv7nn1'),
 Comment(id='ewvf5e4'),
 Comment(id='ewv4n1o'),
 Comment(id='ewv6v7j'),
 Comment(id='ewvey9a'),
 Comment(id='ewvot07'),
 Comment(id='ewvt3tr'),
 Comment(id='ewvtl8x'),
 Comment(id='ewvu25p'),
 Comment(id='eww4vdo'),
 Comment(id='ewuyvp0'),
 Comment(id='ewvu5ur'),
 Comment(id='ewuwqkh'),
 Comment(id='ewuzji1'),
 Comment(id='ewuxx1j'),
 Comment(id='ewuxs55'),
 Comment(id='ewviltk'),
 Comment(id='ewvsmwy'),
 Comment(id='ewvxmvh'),
 Comment(id='ewvy109'),
 Comment(id='ewvycaq'),
 Comment(id='eww361w'),
 Comment(id='eww5jn7'),
 Comment(id='ewwgqfj'),
 Comment(id='ewv4ezb'),
 Comment(id='ewveh49'),
 Comment(id='ewv1g5b'),
 Comment(id='ewvhhn3'),
 Comment(id='ewvudwy'),
 Comment(id='ewvyc8g'),
 Comment(id='ewv10p5'),
 Comment(id='ewv7g5h'),
 Comment(id='ewvdpl7'),
 Comment(id='ewuv27o'),
 Comment(id='ewvb400'),
 Comment(id='ewvbvjj'),
 <MoreComments count=2244, children=['ewvmcjl', 'ewwbev6', 'ewvde6b', '...']>]

By this point, we come to an important meditation about Reddit data: It is very messy! From the above, we see that we have already considered submissions, but there are also comments, and each comment could also take on either a "parent comment" versus a "child comment" class, each child of which could have its own children, each of them which could have their own children, etc., etc.

Again, in this notebook I'm prioritizing simplicity over robustness, so this is what I will do: For each query we use, we'll get at most 25 submissions that mention the query. Then, out of all of these submissions, we will take a random sample of 500 comments from any of these submissions. Finally, I'll use the body attribute from PRAW's documentation to actually extract the text. I'll put this all together in a function called get_comments, implemented below:

In [17]:
def get_comments(query, subred = 'all'):
    """
    params:
    query: a STRING value indicating a query
    subred: a STRING value indicating a subreddit name (e.g. The_Donald or SandersForPresident); 
               if none entered, automatically becomes "all"
               
    returns:
    comments: a LIST of comments
    """
    ### first, getting 25 latest submissions with the query
    submissions = list(reddit.subreddit(subred).search(query, limit = 25))
    
    ### then, getting all of the comment forest objects
    # (important: here I go up to the second to last object bc the last object is always a MoreComments class)
    comment_forests = [list(submission.comments)[:-1] for submission in submissions]
            
    ### then, appending all of these comments into a single list
    all_comments = []
    
    for comments in comment_forests:
        for comment in comments:
            all_comments.append(comment)

    ### getting a random sample of 500 comments (if there are more than 500, otherwise keep going)
    try:
        all_comments = random.sample(all_comments, 500)
    except:
        pass
    
    ### finally, returning a list of all the comment texts
    comment_texts = [comment.body for comment in all_comments if comment.body != '[deleted]']
    
    return comment_texts
In [18]:
### example: getting comments from 
test = get_comments('Bernie Sanders')

Using the function created above, we can now create a new function, get_wordcloud, that would make good use of the function above to create wordclouds based on all of the words used in comments from submissions related to a specific query.

In [19]:
def get_wordcloud(query, subreddit = 'all'):
    """
    params:
    query: a STRING value representing a query
    subreddit: a STRING value representing a subreddit (e.g. "The_Donald" or "worldnews"); 
                type "all" for all subreddits
                
    returns:
    wordcloud: a WordCloud object representing a word cloud of topics associated with a candidate
    """
    ### first, call the `get_comments` functionality from above
    comments = get_comments(query, subreddit)
    
    ### then, combing all of these into a single string and setting everything to lowercase
    string = ' '.join(comments).lower()
    
    # removing anything within () and [] (usually denotes URLS in reddit data) with regular expressions
    string = re.sub("[\(\[].*?[\)\]]", " ", string)
    
    # also, removing anything that's not alphanumeric ***while also keeping spaces and apostrophes*** 
    ### (((using python methods)))
    string = ''.join([i if (i.isalpha() | i.isnumeric() | (i == '\'')) else ' ' for i in string])
    
    # removing the query itself from all of the comments
    for i in query.lower().split(' '):
        string = string.replace(i, '')
        string = string.replace('  ', '')
        
    # also removing more words ("stopwords") that commonly come up for all of the candidates
    more_stopwords = ['people', 'will', 'think', 'want', 'need', 'candidate', 'campaign',
                      'way', 'make', 'going', 'right', 'know', 'really', 'thing']
    
    for i in more_stopwords:
        string = string.replace(i, '')
        string = string.replace('  ', '')
    
    ### finally, generate the wordcloud
    wordcloud = WordCloud(background_color='white', width=2000, height=1000).generate(string)
    
    ### setting figure options, titles, labels, and axes
    plt.figure(figsize=[15, 5])
    plt.title("Query: %s" %(query), size=14)
    plt.text(1500, -30, "Subreddit: %s" %(subreddit) , bbox={'facecolor': 'red', 'alpha': 0.5, 'pad': 5})
    plt.axis("off")
    
    ### finally, returning the plot
    plt.imshow(wordcloud)
    
    return plt.show()
In [20]:
### example: getting a word cloud of pete buttigieg
get_wordcloud('Pete Buttigieg')

We can now use the get_wordcloud function to see what is said about each candidate given a specific subreddit. For instance, I'm curious to see what words come up many times for Bernie Sanders in the SandersForPresident subreddit versus the The_Donald subreddit:

In [21]:
### Bernie Sanders dicsussion from SandersForPresident...
get_wordcloud('Bernie Sanders', 'SandersForPresident')
In [22]:
### ...from The_Donald...
get_wordcloud('Bernie Sanders', 'TheDonald')

We could also see how discussion of Elizabeth Warren differs across different subreddits. I'm especially interested in seeing how she's discussed in pro-Bernie versus pro-Trump subreddits, though whether there is any clear distinction to draw from each of the subreddits is debatable.

In [23]:
### Bernie Sanders dicsussion from SandersForPresident...
get_wordcloud('Elizabeth Warren', 'WayOfTheBern')
In [24]:
### Bernie Sanders dicsussion from SandersForPresident...
get_wordcloud('Elizabeth Warren', 'TheDonald')

Conclusions and Future Work

This notebook gives only a peak at the full range of insights and functionalities we can tap into with the PRAW/Pushshift packages. There are still many more ways to think about how to draw insight from unstructured data that is spit out by Reddit:

  • First, the limits of results can seem a bit restricting at first. PRAW especially imposes strict limits so that only a certain number of HTTP requests can be made within a certain time (these details can be found in the PRAW documentation). However, the main way to get around this limit is by being as specific as possible with regards to the parameters that one can include in the search requests. In other words, since we will only get a certain number of results per request that is made, that means we need to make many requests. We could do this by, for example, changing the "created_utc" parameter so that we can get, say, 250 requests for every 24 hour period, or by setting the subreddit parameter to a variety of subreddits so that we get as many results back from each subreddit we include in our search.

  • Second, as shown above, it is actually quite difficult to gauge sentiment from the words that are included in the wordclouds. The NLTK package in python is great at creating very simple (maybe too simple) scores for sentiment analysis. Maybe combining scores from NLTK with the number of "upvotes" or "downvotes" a specific comment gets (which we can also extract from the PRAW package) could be useful in gauging sentiment of the candidates. Additionally, creaing a supervised machine learning exercise with sklearn to assign sentiment on Reddit comments may also prove useful, though this might first require a real human to assign labels to the target space. Finally, a time element would also be interesting--how has sentiment around "Bernie Sanders" in TheDonald subreddit changed over the past few months? What about Elizabeth Warren within the "Bernie Sanders" subreddit?

  • Finally, as also shown above, it is quite difficult to decipher what are the exact topics that are being most commonly talked about in reference to the specific Democratic candidates. There is a quite large literature about how to solve this problem in python by using topic modeling. Specifically, Latent Dirichlect Allocation (LDA) has been fairly popular in python--though this may also require some human intuition into how many topics one ought to create.